In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import plotly.graph_objs as go
import bs4
import requests
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import ExtraTreesRegressor
In [2]:
dt=pd.read_csv("https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv")

I. Overview data

First we will look at the structure of the data and interpret the meaning of feature columns, and then evaluate some problems in the data. Data includes 5043 rows of different movies and 28 columns that are the information related to the movies.

  • color: color movies or black and white movies
  • director_name
  • num_critic_for_reviews: number of critics. The popularity of a movie in social network can be largely affected by the critics
  • duration: the time length of the movies in minutes
  • director_facebook_likes: number of likes from the facebook of the director
  • actor_1_facebook_likes:number of like from the facebook of actor/actress that has the highest number of like for all cast members
  • actor_2_facebook_likes:number of like from the facebook of actor/actress that has the second lagest number of like for all cast members
  • actor_3_facebook_likes: number of like from the facebook of actor/actress that has the third lagest number of like for all cast members
  • cast_total_facebook_likes: number of likes from the facebook of all the cast members.
  • movie_facebook_likes: number of like from facebook of the movie
  • actor_1_name: name of actor/actress that has the highest number of like from facebook for all cast members.
  • actor_2_name: name of actor/actress that has the second largest number of like from facebook for all cast members.
  • actor_3_name: name of actor/actress that has the third largest number of like from facebook for all cast members.
  • gross: gross earning of the movie
  • genres: movies genres
  • movie_title
  • num_voted_users: number of user that voted for the movie
  • facenumber_in_poster: number of human faces in movie posters
  • plot_keywords: keywords from the movie content
  • movie_imdb_link: the link of the movie on imdb website
  • num_user_for_reviews: number of users that write review for the movie
  • language: language in the movie
  • country: country which the movie was produced
  • content_rating: which age group is suitable to view such as R, PG, PG-13, etc.
  • budget: budget to produce the movie
  • title_year: the year that movie was produced
  • imdb_score: imdb rating (targer variable)
  • aspect_ratio: the widescreen of the movie (16:9 for example)

There is some data problems that we need to handle:

  1. Columns with object type: these columns are text or category type, not numerical, so we need to transform it into dummy variables.
  2. Object columns with many different values, so we need to group these values to avoid creating to many dummy variabels.
  3. Missing data (21 columns).
  4. Remove duplicates.
In [3]:
''' Overview data '''

print(dt.info())

print('\nNumber of column type:\n', dt.dtypes.value_counts()) 

''' Missing values''' 

# Function to calculate missing values by column
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("\nDataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.\n", mis_val_table_ren_columns)
        
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

# Missing value
misssing_col= missing_values_table(dt)


''' Check duplicate'''

def dup_remove(data):
    print('Remove', dt.duplicated().sum(),'dupicates')
    
    return data.drop_duplicates()

dt= dup_remove(dt)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB
None

Number of column type:
 float64    13
object     12
int64       3
dtype: int64

Dataframe has 28 columns.
There are 21 columns that have missing values.
                          Missing Values  % of Total Values
gross                               884               17.5
budget                              492                9.8
aspect_ratio                        329                6.5
content_rating                      303                6.0
plot_keywords                       153                3.0
title_year                          108                2.1
director_name                       104                2.1
director_facebook_likes             104                2.1
num_critic_for_reviews               50                1.0
actor_3_name                         23                0.5
actor_3_facebook_likes               23                0.5
num_user_for_reviews                 21                0.4
color                                19                0.4
duration                             15                0.3
facenumber_in_poster                 13                0.3
actor_2_name                         13                0.3
actor_2_facebook_likes               13                0.3
language                             12                0.2
actor_1_name                          7                0.1
actor_1_facebook_likes                7                0.1
country                               5                0.1
Remove 45 dupicates

II. Data Visualization

In [4]:
# Number of unique classes in each object column
print('\n Number of unique value:\n', dt.apply(pd.Series.nunique, axis = 0).sort_values())
 Number of unique value:
 color                           2
content_rating                 18
facenumber_in_poster           19
aspect_ratio                   22
language                       47
country                        65
imdb_score                     78
title_year                     91
duration                      191
director_facebook_likes       435
budget                        439
num_critic_for_reviews        528
movie_facebook_likes          876
actor_1_facebook_likes        878
actor_3_facebook_likes        906
genres                        914
actor_2_facebook_likes        917
num_user_for_reviews          954
actor_1_name                 2097
director_name                2398
actor_2_name                 3032
actor_3_name                 3521
cast_total_facebook_likes    3978
gross                        4035
plot_keywords                4760
num_voted_users              4826
movie_title                  4917
movie_imdb_link              4919
dtype: int64
In [5]:
def unique_count_plt(data, col_list):
    '''
    Bar plot of unique values. Also show number of missing values to compare.
    data: dataframe
    col_list: list column name want to plot

    '''
    for col in col_list:
        count= data[col].value_counts(dropna= False)
        percent= count*100/count.sum()    
        percent.plot.bar(figsize=(15,7)) # series function plot bar
        plt.title('% unique values in '+ col+  ' column')
        plt.show()
unique_count_plt(dt, ['color','content_rating','country', 'language','facenumber_in_poster','aspect_ratio','title_year'])
In [6]:
def box_plt_category(data, col_list, target_col):
    '''
    Graph box plot of each column in descending order of median 
    data: pandas data frame
    col_list: list of column name that want to be show on x_axis
    target_col: column name of target variable that want to be show on y_axis
    '''
    for col in col_list:
        
        grouped = data.loc[:,[col, target_col]].groupby([col]).median().sort_values(by= target_col, ascending= False)
        plt.figure(figsize=(20, 6))
        chart= sns.boxplot(x=col, y=target_col, data=data, order=grouped.index)
        #chart= sns.violinplot(x=col, y=target_col, data=data, order=grouped.index)
        #sns.stripplot(x=col, y=target_col, data=data, order=grouped.index)
        chart.set_xticklabels(chart.get_xticklabels(), rotation=90, horizontalalignment='right')
        chart.set_title('Box plot of ' + col, fontsize=16)
        
box_plt_category(dt, ['color','content_rating', 'country', 'language','facenumber_in_poster','title_year','aspect_ratio'], 'imdb_score')

The result of checking columns that has the number of unique values < 100 (except target column imdb_score):

  • Color column has 2 unique values. Most of movies are color movies (95%). Black and white accounts for 4 % and a small portion of Nan values. Black movies have higher median imdb score and most of them are > 5 while many color movies is rated below 5 (bad movie).
  • Content_rating column has 18 types: R (41%), PG-13(29%), PG (14%), missing value (6%), and other values account a small proportion. The median rating is not different significantly between each type.
  • Country column has 65 different values. Most movies come from USA and UK (total 84% for both), including a large amount of bad movies. The median imdb scores for these two countries are not the highest. Countries such as Portland, Iran, Sweden and Brazil, produced a small number of movies with higher median imdb scores.
  • Language column has 47 different languages. Most movies are English (93%) because US and Uk are major countries for movie making. Many english movies has rating <5
  • facenumber_in_poster has 19 different number. Most movies have less than 3 human faces. Median imbd score is also higher for number of faces < 3.
  • aspect_ratio: most of them are 2.35 and 1.85, missing values (6.5%). The median rating is not different significantly between aspect_ratio.
  • tilte_year: more than 90% of number movie was produced after 1995 when cinema industry thrived, misisng values accounts 2.1%. Along with the boom of movie industry, after 2000 there are many movies with low imdb score.
In [7]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

def scatter_plt_numerical(data, col_list, target_col, col_movie_title):
    '''
    Scatter plot  
    data: pandas data frame
    col_list: list of column name that want to be show on y_axis
    target_col: column name of target variable that want to be show on x_axis
    '''
    for col in col_list:
        
        
        dict_of_fig = dict({
            
            "data": [{'y': data[target_col] ,
                      'x': data[col] ,
                      'mode': 'markers', # marker= scatter
                      'marker': dict(size=8, color=data.movie_facebook_likes, colorbar= dict(title="movie_facebook_likes"), colorscale='Viridis', showscale=True),
                      "text" :  data[col_movie_title]
                    }],        

            "layout": {"title": {"text": "Scatter plot between Imdb_score and " + col},
                      "xaxis_title":{"text": col },
                      "yaxis_title":{"text": "imdb_score" },
                        
                      }
                            })
        fig = go.Figure(dict_of_fig)

        fig.show()

list_col= ['director_facebook_likes','num_critic_for_reviews','duration', 'actor_1_facebook_likes', 'actor_2_facebook_likes',\
           'actor_3_facebook_likes','cast_total_facebook_likes','gross','num_voted_users',\
           'num_user_for_reviews','budget']
       
scatter_plt_numerical(dt, list_col, 'imdb_score','movie_title')
In [8]:
# color= ['lighseagreen','pink','red','blue','lavender','papayawhip','yellow','grey','cyan','orange','orchid']
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

def violin_plot(data, col_list):
    '''
    Violin plot: plot density distribution of continuous variables
    '''
    for col in col_list:
        fig = go.Figure(go.Violin(y = data[col],box_visible=True,line_color='black',\
                          fillcolor='lightseagreen', opacity=0.6,meanline_visible = True, x0=col))
        fig.show()
        
list_col= ['movie_facebook_likes','director_facebook_likes','num_critic_for_reviews','duration', 'actor_1_facebook_likes',\
           'actor_2_facebook_likes','actor_3_facebook_likes','cast_total_facebook_likes','gross','num_voted_users',\
           'num_user_for_reviews','budget']
        
violin_plot(dt, list_col)

The result of checking columns that has the number of unique values > 100:

  • Variable such as director_facebook_likes, movie_facebook_likes, num_critic_for_reviews, actor_3_facebook_likes,gross, num_voted_users, and num_user_for_reviews show a relationship with imdb rating, namely high values of these variables tends to go with high imdb rating, though still have movies that these variables are low but get high imdb rating.
  • Other variables such as duration, actor_1_facebook_likes, actor_2_facebook_likes, cast_total_facebook_likes, cast_total_facebook_likes and budget seems not affect to imdb rating
In [9]:
'''Histogram plot of target vairable: imbd_score'''
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import plotly.figure_factory as ff

x = dt['imdb_score'].values
hist_data = [x]
group_labels = ['imdb_score'] 

fig = ff.create_distplot(hist_data, group_labels, colors= ['grey'], bin_size=.1)
fig.layout.update(title='Histogram and desity plot of Imdb_score') 
fig.show()

III. Adding a new variable

The greatness of a movie is highly affected by its director, so I will add a new column that can measure this property.

    1. Do web scraping from wikipedia to get lists of winning canne directors and winning oscar directors.
    1. Merge 2 lists and remove duplicate names.
    1. Join back to the original data and create a new column with 1 means that director gets canne or oscar, and 0 is non
In [10]:
''' Do web scraping to get canne director award from wikipedia '''

res1= requests.get('https://en.wikipedia.org/wiki/Palme_d%27Or')

soup1= bs4.BeautifulSoup(res1.content)
table_rows1= soup1.select('table')[1].select('tr')[3:]

cell_values1=[]
for i in range(len(table_rows1)-1):
    
    cell_num= table_rows1[i].select('td')
    if len(cell_num) <=2:
        continue
    elif len(cell_num) ==3:
        v= table_rows1[i].select('td')[1].select('a')[0].text
        cell_values1.append(v)
    elif len(cell_num) == 4:
        v= table_rows1[i].select('td')[2].select('a')[0].text
        cell_values1.append(v)
         
    else:
        v= table_rows1[i].select('td')[3].select('a')[0].text
        cell_values1.append(v)
        

canne_director= pd.DataFrame(cell_values1, columns=['Director'])

canne_director.to_csv('canne_directors.csv') 

''' Do web scraping to get oscar director award from wikipedia  '''

res2= requests.get('https://en.wikipedia.org/wiki/List_of_Academy_Award_for_Best_Director_winners_by_age')

soup2= bs4.BeautifulSoup(res2.content)
table_rows2= soup2.select('table')[0].select('tr')[1:]

cell_values2=[]
for i in range(len(table_rows2)-1):
    v= table_rows2[i].select('td')[1].select('a')[0].text
    cell_values2.append(v)
 

oscar_director= pd.DataFrame(cell_values2, columns=['Director'])

oscar_director.to_csv('canne_directors.csv')

''' Combine unique oscar and canne director names. Create director_aw column with 1 for receiving award and 0 for non award '''
director_award = oscar_director.append(canne_director, ignore_index=True)['Director'].unique()
dt['director_aw']= np.where(dt.director_name.isin(director_award),1,0)
In [11]:
''' imdb_score and number of directors received award'''
dt.groupby('imdb_score')['director_aw'].sum().plot(kind='bar',\
                                                   title='imdb_score and number of directors received award',\
                                                   figsize=(15,7))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dbf33cbeb8>
In [12]:
sns.boxplot(x='director_aw', y='imdb_score', data=dt).set_title('Box plot of director award ')
Out[12]:
Text(0.5, 1.0, 'Box plot of director award ')

Generally, directors that received canne/oscar award have higher ibdb score

IV. Transform categorical variables that will use in the model

1. Content_rating columns

  • M, GP, X are ratings used in the past. M and GP was replaced by PG, and X was replaced by NC-17.

We will group these content ratings into 7 main groups

  • R: includes R, TV-MA
  • PG-13: includes PG-13, TV-14
  • G: includes G,TV-G, TV-Y, TV-Y7
  • PG: includes PG, TV-PG, M, GP
  • NC-17: includes NC-17, X
  • nan
  • Others
In [13]:
dt1= dt.copy()
# Adjust content_rating
dt1.loc[dt1.content_rating=='TV-MA','content_rating']= 'R'
dt1.loc[dt1.content_rating=='TV-14','content_rating']= 'PG-13'
dt1.loc[(dt1.content_rating=='TV-G') | (dt1.content_rating=='TV-Y') | (dt1.content_rating=='TV-Y7'),'content_rating']='G'
dt1.loc[(dt1.content_rating=='TV-PG') | (dt1.content_rating=='M') | (dt1.content_rating=='GP'),'content_rating']='PG'
dt1.loc[dt1.content_rating=='X','content_rating']='NC-17'

# which genres bring the most gross profit (G,PG,and PG-13)
dt1.groupby('content_rating')['gross'].mean().sort_values(ascending=False).plot(y='Gross Earning',kind='bar',\
                                                                               title='Gross earning by Genres',figsize=(10,6))

content_l = ['R','PG','PG-13','G','NC-17',np.nan]
dt1['content_rating'] = np.where(~(dt1.content_rating.isin(content_l)), 'others', dt1.content_rating)

dt1.content_rating.unique()
Out[13]:
array(['PG-13', nan, 'PG', 'G', 'R', 'others', 'NC-17'], dtype=object)

G, PG and PG-13 are genres that often bring higher profits

2. Language column

  • Group into English, non English and nan.
  • English language if the movies come from countries that use English as an official languge such as ['USA','UK','Canada', 'Australia','New Zealand']
In [14]:
Eng_l= ['USA','UK','Canada', 'Australia','New Zealand']
dt1.loc[(dt1.language.isnull()) & (dt1.country.isin(Eng_l)),'language']= 'English'

# Language will be transformed into one dummy variable: with 1 is English language and 0 is non English
dt1['english']= 0
dt1.loc[dt1.language=='English','english']= 1 
dt1.loc[dt1.language.isnull(),'english']= np.NaN

dt1.english.unique()
Out[14]:
array([ 1., nan,  0.])

3. Country column

  • Group into US, UK, France ,nan, and others.
In [15]:
country_l = ['USA','UK','France', np.nan]
dt1['country'] = np.where(~(dt1.country.isin(country_l)), 'others', dt1.country)
dt1.country.unique()
Out[15]:
array(['USA', 'UK', nan, 'others', 'France'], dtype=object)

4. Genres column

  • Each row of genre column contains different values, we will split these values, and get a unique genres list.
  • Create corresponding dummy variables for these values.
In [16]:
# Split genres rows into list of each value and then get its unique values. 
# Create genre dummy variables by match each element of genre rows to new dummy variables.

import numpy as np
all_genres = []
for x in dt.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
zero_matrix = np.zeros((len(dt), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
for i, gen in enumerate(dt.genres):
     indices = dummies.columns.get_indexer(gen.split('|'))
     dummies.iloc[i, indices] = 1
dummies.index = dt1.index
dt2 = dt1.join(dummies.add_prefix('genre_'))
 
dt2.loc[:, dt2.columns.str.startswith('genre_')].columns
Out[16]:
Index(['genre_Action', 'genre_Adventure', 'genre_Fantasy', 'genre_Sci-Fi',
       'genre_Thriller', 'genre_Documentary', 'genre_Romance',
       'genre_Animation', 'genre_Comedy', 'genre_Family', 'genre_Musical',
       'genre_Mystery', 'genre_Western', 'genre_Drama', 'genre_History',
       'genre_Sport', 'genre_Crime', 'genre_Horror', 'genre_War',
       'genre_Biography', 'genre_Music', 'genre_Game-Show', 'genre_Reality-TV',
       'genre_News', 'genre_Short', 'genre_Film-Noir'],
      dtype='object')
In [17]:
genre_l= [i for i in dt2.loc[:, dt2.columns.str.startswith('genre_')].columns]
dt2[genre_l].sum().sort_values(ascending=False).plot.bar(figsize=(8,6), title='Most common genres in movie' )
# genre_Game_Show, genre_Reality_TV, genre_News, genre_Short, and genre_Film-Noir account for a small number of movies (< 6)
# we can drop them
dt2= dt2.drop(['genre_Game-Show', 'genre_Reality-TV', 'genre_News', 'genre_Short', 'genre_Film-Noir'],axis= 'columns')

Drama, comedy, thriller, and action are most common genres for movies.

IV. Remove unnecessary columns

  • Our goal is to predict the Imdb rating at the time it's released, so the information that comes after such as gross earning won't be counted.
  • cast_total_facebook_likes equals to sum of cast members facebook likes, so it will have high corellation with actor1,actor2, actor3 facebook likes and does not contribute additional infomation. We can drop it.
  • Remove following variables: 'gross','movie_imdb_link','plot_keywords','director_name','actor_3_name','actor_2_name', 'actor_1_name','title_year', 'movie_title','genres','language', 'cast_total_facebook_likes'.
In [18]:
drop_l= ['gross','movie_imdb_link','plot_keywords','director_name','actor_3_name','actor_2_name','actor_1_name','title_year',\
               'movie_title','genres','language','cast_total_facebook_likes' ]
dt3= dt2.drop(drop_l,axis= 'columns')

V. Handle missing data

For missing values

  • There are some methods to handle missing values such as filling with mean for numerical variable, mode for categorical variabel, or knn imputer, though these methods often dont give a good accuracy. For this case, because missing values account for a small portion, so we can drop them.
  • After dropping all misisng values, we still have 4136 rows.
In [19]:
m= missing_values_table(dt3)
print('Number of rows before drop missing values:', dt3.shape[0])
dt4= dt3.dropna()
print('Number of rows after drop missing values:', dt4.shape[0])
Dataframe has 39 columns.
There are 14 columns that have missing values.
                          Missing Values  % of Total Values
budget                              487                9.7
aspect_ratio                        327                6.5
content_rating                      301                6.0
director_facebook_likes             103                2.1
num_critic_for_reviews               49                1.0
actor_3_facebook_likes               23                0.5
num_user_for_reviews                 21                0.4
color                                19                0.4
duration                             15                0.3
facenumber_in_poster                 13                0.3
actor_2_facebook_likes               13                0.3
actor_1_facebook_likes                7                0.1
country                               5                0.1
english                               2                0.0
Number of rows before drop missing values: 4998
Number of rows after drop missing values: 4136

VI. Create dummy variable for categorical columns

In [20]:
dt5=pd.get_dummies(dt4)
dt5.columns
Out[20]:
Index(['num_critic_for_reviews', 'duration', 'director_facebook_likes',
       'actor_3_facebook_likes', 'actor_1_facebook_likes', 'num_voted_users',
       'facenumber_in_poster', 'num_user_for_reviews', 'budget',
       'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio',
       'movie_facebook_likes', 'director_aw', 'english', 'genre_Action',
       'genre_Adventure', 'genre_Fantasy', 'genre_Sci-Fi', 'genre_Thriller',
       'genre_Documentary', 'genre_Romance', 'genre_Animation', 'genre_Comedy',
       'genre_Family', 'genre_Musical', 'genre_Mystery', 'genre_Western',
       'genre_Drama', 'genre_History', 'genre_Sport', 'genre_Crime',
       'genre_Horror', 'genre_War', 'genre_Biography', 'genre_Music',
       'color_ Black and White', 'color_Color', 'country_France', 'country_UK',
       'country_USA', 'country_others', 'content_rating_G',
       'content_rating_NC-17', 'content_rating_PG', 'content_rating_PG-13',
       'content_rating_R', 'content_rating_others'],
      dtype='object')
In [21]:
''' Dummy variables often cause multicolliner. To reduce the correlation among dummies variables, we can remove one feature
column from each type of variable'''

drop_l2= ['color_ Black and White','country_others','content_rating_others']
dt6= dt5.drop(drop_l2,axis= 'columns')
dt6.to_csv('cleaned_movie_rating.csv')

# Number of feature left
len(dt6.columns)
Out[21]:
45

VII. Analyzing features

1. Correlation map

  • Most variables dont have high correlation
In [22]:
correlations = dt6.corr()
f,ax = plt.subplots(figsize=(15,15))
matrix = np.triu(correlations)
sns.heatmap(correlations,vmin=-1, vmax=1, linewidths=.5,center= 0,mask= matrix)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dbf1cb14a8>

2. Feature importance

We will apply 3 ways and compare the results:

  • Random forest importance
  • Permutation importance
  • Shap importance
In [23]:
X= dt6.drop(['imdb_score'],axis= 1)
y= dt6['imdb_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
In [24]:
''' Random forest importance '''
feat_labels= X_train.columns

forest= RandomForestRegressor(n_estimators= 500, random_state= 1)
forest.fit(X_train, y_train)

importance_rdf= forest.feature_importances_ # get importance values of each feature
indices_rdf= np.argsort(importance_rdf)
plt.figure(figsize=(8,8))
plt.barh(feat_labels[indices_rdf], importance_rdf[indices_rdf])
plt.xlabel("Random Forest Feature Importance")

''' Get top 25 importance features'''
importance_rdf[indices_rdf]
feat_labels[indices_rdf]
d1= {'feature':feat_labels[indices_rdf], 'importance':importance_rdf[indices_rdf]}
rdf_imp= pd.DataFrame(d1).sort_values(by='importance', ascending=False)
rdf_25= rdf_imp.iloc[0:25,0]
In [25]:
''' Permutation importance '''
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(forest, X_test, y_test)
indices_per = perm_importance.importances_mean.argsort()
plt.figure(figsize=(8,8))
plt.barh(feat_labels[indices_per], perm_importance.importances_mean[indices_per])
plt.xlabel("Permutation Importance")

''' Get top 25 importance features'''
feats_per = {} 
for feature, importance in zip(feat_labels[indices_per], perm_importance.importances_mean[indices_per]):
    feats_per[feature] = importance #add the name/value pair 
permu_imp = pd.DataFrame(list(feats_per.items()),columns = ['feature','importance']) 
permu_25= permu_imp.sort_values(['importance'], ascending=False).iloc[0:25,0]
In [26]:
''' Shap importance'''

import shap
importance = shap.TreeExplainer(forest)
shap_values = importance.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=X_test.shape[1])

''' Get top 25 importance features'''
shap_vals= np.abs(shap_values).mean(0)
shap_fea_imp = pd.DataFrame(list(zip(X_train.columns,shap_vals)),columns=['col_name','feature_importance_vals'])
shap_fea_imp.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
shap_25= shap_fea_imp.iloc[0:25,0]
  • Num_voted_users, genres_drama, budget, duration, num_user_for_reviews are the most importance variables.
  • Take 25 variables that has highest important score from each method, combine and get their unique values. These variables are input in the models.
In [28]:
''' Selected features'''
select_fea= list(set(rdf_25).union(set(permu_25), set(shap_25)))
select_fea
Out[28]:
['country_USA',
 'num_voted_users',
 'genre_Documentary',
 'genre_Fantasy',
 'genre_Biography',
 'genre_Animation',
 'genre_Drama',
 'aspect_ratio',
 'actor_1_facebook_likes',
 'director_aw',
 'genre_Sci-Fi',
 'genre_Comedy',
 'content_rating_R',
 'num_critic_for_reviews',
 'english',
 'duration',
 'director_facebook_likes',
 'genre_Horror',
 'content_rating_PG',
 'movie_facebook_likes',
 'facenumber_in_poster',
 'genre_Thriller',
 'genre_Romance',
 'content_rating_PG-13',
 'budget',
 'num_user_for_reviews',
 'actor_2_facebook_likes',
 'actor_3_facebook_likes',
 'genre_Action']

VIII. Models

  • Test data = 30% of dataset.
  • Run regression to predict the Imdb score, using extra tree, xgboost, random forest, neural network and stacking regressor models.
  • Train data will be divided into 5 k-folds to do cross validation. Data for xgboost and neural network will be standardized before training.
  • Use GridSearchCV to select best parameters and then combine all the best models to run essemble stacking model.
In [29]:
X1= dt6[select_fea]
y1= dt6['imdb_score']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=1)
In [30]:
import warnings
warnings.filterwarnings('ignore')

stdScaler= StandardScaler()

'''Extra Tree Regressor'''
ex_tree=ExtraTreesRegressor(random_state=1)

ex_tree_grid={'criterion': ['mse', 'friedman_mse', 'mae'],
              'n_estimators':[100,200],
            
                }


''' XGBoost ''' 
#xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=1)
xgb_model = xgb.XGBRegressor(random_state=1)

xgb_pipe= Pipeline([('scaler', stdScaler), ('xgboost', xgb_model)])

xgbgrid= {
        'xgboost__objective':['reg:linear'],
        'xgboost__learning_rate':[0.0001, 0.001, 0.01, 0.1, 1.0],
        'xgboost__n_estimators':[500],
        'xgboost__reg_lamda':[1e-5, 1e-2, 0.1, 1.0, 100.0],
        'xgboost__reg_alpha':[1e-5, 1e-2, 0.1, 1.0, 100.0]
        }

''' Random forest'''

rdforest = RandomForestRegressor(random_state=1)
#rdforest_pipe= Pipeline([('pca': PCA()), ('rdforest', rdforest)])

rdforest_grid = {'n_estimators': [500], 
              'max_features': ['log2', 'sqrt'], 
              'criterion': [ 'mse','mae']
                  }


'''Neural Network'''
neuralnet= MLPRegressor(random_state=1)

nnet_pipe = Pipeline([('scaler', stdScaler), ('neural_net', neuralnet)])

nnet_grid = {
            'neural_net__solver': ['adam'],
            'neural_net__max_iter': [500],
            'neural_net__alpha': [0.0001, 0.001, 0.01, 0.1, 1.0],
            'neural_net__hidden_layer_sizes':[(10,),(25,),(50,)],
            'neural_net__learning_rate': ['adaptive']
            }

# model name
modelnames= ['ex_tree','xgbreg','rdforest', 'neural_net']

#gridsearch list
gridList= [ex_tree_grid, xgbgrid,rdforest_grid, nnet_grid]

#List pipe
pipeList= [ex_tree, xgb_pipe, rdforest, nnet_pipe]


metrics= pd.DataFrame(data=None, columns= ['mae_train','mae_test','rmse_train','rmse_test','r2_train','r2_test'])

modelrs= pd.DataFrame(data=None, columns= ['model', 'y_train_pred','y_test_pred' ])

for pipe, grid, name in zip(pipeList, gridList, modelnames):
    
    # create stratified kfold
    k_fold = KFold(n_splits=5, random_state=1, shuffle=True).split(X1_train, y1_train)
        
    gs= GridSearchCV(pipe, grid, scoring= 'r2', cv= k_fold, n_jobs=-1) 
    
    # Fit grid search
    best_model = gs.fit(X1_train, y1_train)
        
    # get parameter and best model
    print('\nModel: ' , name,'\n')
    print(best_model.best_params_)
    y_train_pred= best_model.predict(X1_train)
    y_test_pred= best_model.predict(X1_test)
    
    
    # Metrics to evaluate the performance
    
    mae_train= mean_absolute_error(y1_train, y_train_pred)
    mae_test= mean_absolute_error(y1_test, y_test_pred)

    rmse_train= mean_squared_error(y1_train, y_train_pred, squared=True)
    rmse_test= mean_squared_error(y1_test, y_test_pred, squared=True)
    
    r2_train= r2_score(y1_train, y_train_pred)
    r2_test= r2_score(y1_test, y_test_pred)
  
    
    rs= pd.Series({'mae_train': mae_train,'mae_test': mae_test,'rmse_train': rmse_train,'rmse_test': rmse_test,\
                   'r2_train': r2_train,'r2_test':r2_test }, name= name)
    metrics= metrics.append(rs)
    
    gr= pd.Series({'model': best_model.best_estimator_, 'y_train_pred':y_train_pred, 'y_test_pred':y_test_pred}, name= name)
    modelrs = modelrs.append(gr)
Model:  ex_tree 

{'criterion': 'mse', 'n_estimators': 200}
[15:19:46] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Model:  xgbreg 

{'xgboost__learning_rate': 0.1, 'xgboost__n_estimators': 500, 'xgboost__objective': 'reg:linear', 'xgboost__reg_alpha': 1.0, 'xgboost__reg_lamda': 1e-05}

Model:  rdforest 

{'criterion': 'mse', 'max_features': 'sqrt', 'n_estimators': 500}

Model:  neural_net 

{'neural_net__alpha': 1.0, 'neural_net__hidden_layer_sizes': (50,), 'neural_net__learning_rate': 'adaptive', 'neural_net__max_iter': 500, 'neural_net__solver': 'adam'}
In [31]:
''' Stacking model''' 

ex_treebest= ExtraTreesRegressor(criterion='mse', n_estimators= 200, random_state=1)

xgboostbest= xgb.XGBRegressor(learning_rate= 0.1, n_estimators= 500, 
                              objective='reg:squarederror',reg_alpha= 1.0, 
                              reg_lamda= 1e-05,random_state=1)

rdforestbest= RandomForestRegressor(criterion= 'mse', max_features= 'sqrt', n_estimators=500, random_state=1)

neural_netbest= MLPRegressor(alpha= 1.0, hidden_layer_sizes= (50,),
                             learning_rate= 'adaptive', max_iter= 500,
                             solver='adam', random_state=1)

estimators = [('ex_tree', ex_treebest),('xgb', xgboostbest), ('forest', rdforestbest), ('nnet', neural_netbest)]

stackreg = StackingRegressor(estimators=estimators, final_estimator=RandomForestRegressor(n_estimators=20,random_state=1))

stackreg.fit(X1_train, y1_train)
y_train_pred= stackreg.predict(X1_train)
y_test_pred= stackreg.predict(X1_test)

# Metrics to evaluate the performance

mae_train= mean_absolute_error(y1_train, y_train_pred)
mae_test= mean_absolute_error(y1_test, y_test_pred)

rmse_train= mean_squared_error(y1_train, y_train_pred, squared=True)
rmse_test= mean_squared_error(y1_test, y_test_pred, squared=True)

r2_train= r2_score(y1_train, y_train_pred)
r2_test= r2_score(y1_test, y_test_pred)

rs= pd.Series({'mae_train': mae_train,'mae_test': mae_test,'rmse_train': rmse_train,'rmse_test': rmse_test,\
               'r2_train': r2_train,'r2_test':r2_test }, name= 'stackreg')
metrics= metrics.append(rs)

gr= pd.Series({'model': best_model.best_estimator_, 'y_train_pred':y_train_pred, 'y_test_pred':y_test_pred}, name= 'stackreg')
modelrs = modelrs.append(gr)
In [32]:
''' Function to save models'''
modelnames1= ['ex_tree','xgbreg','rdforest', 'neural_net', 'stackreg']

from sklearn.externals import joblib
def savemodel(modeldf,colname,listmodel):
    for i in listmodel:
        joblib.dump(modeldf.loc[i][colname], i+'.pkl') 

savemodel(modelrs,'model',modelnames1)
metrics.to_csv('metrics.csv')

IX. Results and models evaluations

In [33]:
pd.options.display.float_format = "{:,.4f}".format
metrics
Out[33]:
mae_train mae_test rmse_train rmse_test r2_train r2_test
ex_tree 0.0000 0.4896 0.0000 0.4873 1.0000 0.6184
xgbreg 0.3158 0.4908 0.1812 0.4753 0.8409 0.6278
rdforest 0.1876 0.5057 0.0682 0.5080 0.9401 0.6022
neural_net 0.4273 0.5288 0.3604 0.5357 0.6835 0.5805
stackreg 0.2252 0.5114 0.1346 0.5110 0.8818 0.5999
  • Compare all metrics, the best model is Xgboost wih lowest MEA and RMSE, and highest R square. For this model, all input variables can explain for 62% of the change in imbd score.
  • Etra tree has R_square_train =1, and mea_train =0 indicating it's a perfect fit, which we should suspect the result. The model show overfitting problem.
In [34]:
''' Residuals plot '''

def residual_plt(modeldt, y_train, y_test):
    for i in range(len(modeldt)):
        
        plt.figure(figsize=(10,5))
        plt.scatter(modeldt.iloc[i].y_train_pred, modeldt.iloc[i].y_train_pred - y_train,c='steelblue',edgecolor='white',marker='o',s=35,alpha=0.9,label='Training data')
        plt.scatter(modeldt.iloc[i].y_test_pred, modeldt.iloc[i].y_test_pred - y_test, c='limegreen', edgecolor='white', marker='s', s=35, alpha=0.9, label='Test data')
        plt.xlabel('Predicted values')
        plt.ylabel('Residuals')
        plt.legend(loc='upper left')
        plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='black')
        plt.xlim([0, 15])
        plt.title('Residuals plot of '+ modeldt.index[i] )
        plt.show()
        

residual_plt(modelrs, y1_train, y1_test)
        

Overall, residuals of xgboost has better shape

In [35]:
''' Predicted values vs True values'''
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)


indexmovie= [i for i in y1_test.index]
movie_name= dt2.loc[indexmovie,:].movie_title
dfrs={'movie': movie_name, 'y1_test': y1_test.values}
test_score= pd.DataFrame(dfrs, index= y1_test.index)


def predict_true_plt(test_df, model_df, metrics, movie_title):
    '''
    Scatter plot  
    data: pandas data frame
    col_list: list of column name that want to be show on y_axis
    target_col: column name of target variable that want to be show on x_axis
    '''

    md_names= metrics.index
    
    for i in range(len(metrics)):
        test_df['y1_predict']= model_df.iloc[i].y_test_pred
        col= (test_df['y1_test'] - test_df['y1_predict']).abs()
        dict_of_fig = dict({
            
            "data": [{'y': test_df['y1_test'] ,
                      'x': test_df['y1_predict'] ,
                      'mode': 'markers', # marker= scatter. 'markers+text' la se hien thi text fix tren chart, not hover de hien
                      'marker': dict(size=8, color=col, colorbar= dict(title="Absolute residuals"), colorscale='Viridis', showscale=True),
                      #'text' :  movie_name
                      'hovertext' :  movie_title
                    }],        

            "layout": {'title': {'text': 'Imdb scores of ' + md_names[i] +
                                         '.MAE:' + str(np.round(metrics.iloc[i,1],3)) +
                                         '.RMSE:' + str(np.round(metrics.iloc[i,3],3)) + 
                                         '. R2:' + str(np.round(metrics.iloc[i,5],3)) 
                                 },
                      'xaxis_title':{'text': 'Predicted scores' },
                      'yaxis_title':{'text': 'Actual score' }}
                      
            
                            })
        fig = go.Figure(dict_of_fig)
        
        fig.update_layout(annotations=[
                            dict(x=test_df.loc[2295,].y1_predict ,
                                 y=test_df.loc[2295,].y1_test, 
                                 showarrow=True,
                                 text='Superbabies:Baby Geniuses 2. Pre:' +str(np.round(test_df.loc[2295,].y1_predict,1)) + ', True:'\
                                +str(np.round(test_df.loc[2295,].y1_test,1)),
                                xanchor='left',
                                yanchor='bottom'),
            
                            dict(x=test_df.loc[683,].y1_predict ,
                                 y=test_df.loc[683,].y1_test, 
                                 showarrow=True,
                                 text='Fight Club. Pre:' + str(np.round(test_df.loc[683,].y1_predict,1)) + ', True:'\
                                +str(np.round(test_df.loc[683,].y1_test,1))),                         
                            
                                         ], autosize=True)

        fig.show()
       
predict_true_plt(test_score,modelrs, metrics, movie_name)

In general, the prediction and true value forms a 45-degree line, which means their predicted and actual values are close. Movies such as Superbabies: Baby Geniuses 2 doesn’t have a good prediction, but movie such as Fight Club performs well in predicting the score. We will check how input variables affect to the output of Xgboost model.

In [36]:
''' How features affect to Xgboost results'''

xgb_model= xgb.XGBRegressor(learning_rate= 0.1, n_estimators= 500, 
                              objective='reg:squarederror',reg_alpha= 1, 
                              reg_lamda= 1e-05,random_state=1)
xgb_model.fit(X1_train, y1_train)
explainerxgb = shap.TreeExplainer(xgb_model)
shap_values_xgb = explainerxgb.shap_values(X1_test)

shap.summary_plot(shap_values_xgb, X1_test, max_display=X1_test.shape[1])

The order of y_axis indicates the importance of the features in the model, and how output changes when these feature changes its values:

  • High num_voted_users will increase the imdb score. The same with long duration.
  • High budget doesn’t mean high th imdb score.
  • If a movie has drama genres, the imdb score will increase, while horror genres will decrease imdb score.
  • If director receive oscar/canne award, imdb score is increased.
  • Content_rating PG-13 and R will decrease imdb score
  • More facenumber_in_poster will decrease the imdb score.
In [ ]:
!jupyter nbconvert  ratingCopy1.ipynb --to html